We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature prediction. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. The video representation is then obtained by taking the ensemble of the concatenation of embeddings of separate video clips with a video clip set summarization token. These techniques make our method not only extremely efficient to train, but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
translated by 谷歌翻译
AASM准则是为了有一种常用的方法,旨在标准化睡眠评分程序的数十年努力的结果。该指南涵盖了从技术/数字规格(例如,推荐的EEG推导)到相应的详细睡眠评分规则到年龄的几个方面。在睡眠评分自动化的背景下,与许多其他技术相比,深度学习表现出更好的性能。通常,临床专业知识和官方准则对于支持自动睡眠评分算法在解决任务时至关重要。在本文中,我们表明,基于深度学习的睡眠评分算法可能不需要充分利用临床知识或严格遵循AASM准则。具体而言,我们证明了U-Sleep是一种最先进的睡眠评分算法,即使使用临床非申请或非规定派生,也可以解决得分任务,即使无需利用有关有关的信息,也无需利用有关有关的信息。受试者的年代年龄。我们最终加强了一个众所周知的发现,即使用来自多个数据中心的数据始终导致与单个队列上的培训相比,可以使性能更好。确实,我们表明,即使增加了单个数据队列的大小和异质性,后者仍然有效。在我们的所有实验中,我们使用了来自13个不同临床研究的28528多个多摄影研究研究。
translated by 谷歌翻译
我们研究视觉变压器(VIT)的半监督学习(SSL),尽管VIT架构广泛采用了不同的任务,但视觉变形金刚(VIT)还是一个不足的主题。为了解决这个问题,我们提出了一条新的SSL管道,该管道由第一个联合国/自制的预训练组成,然后是监督的微调,最后是半监督的微调。在半监督的微调阶段,我们采用指数的移动平均线(EMA) - 教师框架,而不是流行的FixMatch,因为前者更稳定,并且为半手不见的视觉变压器提供了更高的准确性。此外,我们提出了一种概率的伪混合机制来插入未标记的样品及其伪标签以改善正则化,这对于训练电感偏差较弱的训练VIT很重要。我们所提出的方法被称为半vit,比半监督分类设置中的CNN对应物获得可比性或更好的性能。半vit还享受VIT的可伸缩性优势,可以很容易地扩展到具有越来越高的精度的大型模型。例如,半效率总数仅使用1%标签在Imagenet上获得令人印象深刻的80%TOP-1精度,使用100%ImageNet标签与Inception-V4相当。
translated by 谷歌翻译
在这项工作中,我们介绍了一种新型的元学习方法,用于基于自学学习的学习来进行睡眠评分。我们的方法旨在构建可以概括不同患者和记录设施的睡眠评分模型,但不需要进一步适应目标数据。为了实现这一目标,我们通过合并自我监督的学习(SSL)阶段并将其称为S2MAML,在模型不可知的元学习(MAML)框架上构建方法。我们表明S2MAML可以显着胜过MAML。性能的增长来自SSL阶段,我们以通用伪任务为基础,该任务限制了培训数据集中存在的特定主题模式。我们表明,S2MAML在SC,ST,ISRUC,UCD和CAP数据集上优于标准监督学习和MAML。
translated by 谷歌翻译
研究目标:评分多个词法中的绩效差异是一个众所周知的问题。大多数现有的自动睡眠评分系统都是使用单个得分手注释的标签培训的,该标签将主观评估转移到模型中。当有两个或多个得分手的注释可用时,评分模型通常会在得分手共识上训练。平均得分手的主观性被转移到模型中,失去了有关不同得分子之间内部变异性的信息。在这项研究中,我们旨在将不同医生的多重知识插入培训程序中。目标是优化模型培训,利用可以从一组得分手共识中提取的全部信息。方法:我们在三个不同的多得分数据库上训练两个基于深度学习的模型。我们将标签平滑技术与软传感器(LSSC)分布一起利用,以在模型的训练过程中插入多重知识。我们介绍了平均余弦相似性度量(ACS),以量化模型与LSSC产生的催眠密度毛电和得分手共识产生的催眠密度图之间的相似性。结果:当我们使用LSSC训练模型时,模型的性能会改善所有数据库。我们发现,通过LSSC训练的模型和共识产生的催眠仪型的催眠刻画之间的ACS增加(高达6.4%)。结论:我们的方法绝对使模型能够更好地适应得分手的共识。未来的工作将集中于对不同评分体系结构的进一步调查。
translated by 谷歌翻译
我们提出了一种解决从脸部单个运动模糊图像的新观点渲染夏普视频的新颖任务。我们的方法通过隐式地通过三个大型数据集的联合训练来处理面部的几何和运动来处理面部模糊的复杂性:FFHQ和300VW,我们构建的新伯尔尼多视图DataSet(BMFD) 。前两个数据集提供了各种各样的面,并允许我们的模型更好地概括。 BMFD允许我们引入多视图约束,这对于从新的相机视图综合夏普视频至关重要。它由来自多个主题的多种视图的高帧速率同步视频组成,这些拍摄对象的多个观点显示了广泛的面部表情。我们使用高帧率视频通过平均来模拟现实运动模糊。感谢此数据集,我们训练一个神经网络,从单个图像和相应的面凝视中重建3D视频表示。然后,我们将相对于估计的凝视和模糊图像提供相机视点,作为对编码器解码器网络的输入,以生成具有新颖的相机视点的锐框的视频。我们在我们的多视图数据集和Vidtimit的测试对象上展示了我们的方法。
translated by 谷歌翻译
优化通常是一个确定性问题,其中通过诸如梯度下降的一些迭代过程找到解决方案。然而,当培训神经网络时,由于样本的子集的随机选择,损耗函数会超过(迭代)时间。该随机化将优化问题转变为随机级别。我们建议将损失视为关于一些参考最优参考的嘈杂观察。这种对损失的解释使我们能够采用卡尔曼滤波作为优化器,因为其递归制剂旨在估计来自嘈杂测量的未知参数。此外,我们表明,用于未知参数的演进的卡尔曼滤波器动力学模型可用于捕获高级方法的梯度动态,如动量和亚当。我们称之为该随机优化方法考拉,对于Kalman优化算法而言,具有损失适应性的缺陷。考拉是一种易于实现,可扩展,高效的方法来训练神经网络。我们提供了通过实验的收敛分析和显示,它产生了与跨多个神经网络架构和机器学习任务的现有技术优化算法的现有状态的参数估计,例如计算机视觉和语言建模。
translated by 谷歌翻译
In this paper we study the problem of image representation learning without human annotation. By following the principles of selfsupervision, we build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a pretext task, which requires no manual labeling, and then later repurposed to solve object classification and detection. To maintain the compatibility across tasks we introduce the context-free network (CFN), a siamese-ennead CNN. The CFN takes image tiles as input and explicitly limits the receptive field (or context) of its early processing units to one tile at a time. We show that the CFN includes fewer parameters than AlexNet while preserving the same semantic learning capabilities. By training the CFN to solve Jigsaw puzzles, we learn both a feature mapping of object parts as well as their correct spatial arrangement. Our experimental evaluations show that the learned features capture semantically relevant content. Our proposed method for learning visual representations outperforms state of the art methods in several transfer learning benchmarks.
translated by 谷歌翻译
We are witnessing a widespread adoption of artificial intelligence in healthcare. However, most of the advancements in deep learning (DL) in this area consider only unimodal data, neglecting other modalities. Their multimodal interpretation necessary for supporting diagnosis, prognosis and treatment decisions. In this work we present a deep architecture, explainable by design, which jointly learns modality reconstructions and sample classifications using tabular and imaging data. The explanation of the decision taken is computed by applying a latent shift that, simulates a counterfactual prediction revealing the features of each modality that contribute the most to the decision and a quantitative score indicating the modality importance. We validate our approach in the context of COVID-19 pandemic using the AIforCOVID dataset, which contains multimodal data for the early identification of patients at risk of severe outcome. The results show that the proposed method provides meaningful explanations without degrading the classification performance.
translated by 谷歌翻译
Human Activity Recognition (HAR) is one of the core research areas in mobile and wearable computing. With the application of deep learning (DL) techniques such as CNN, recognizing periodic or static activities (e.g, walking, lying, cycling, etc.) has become a well studied problem. What remains a major challenge though is the sporadic activity recognition (SAR) problem, where activities of interest tend to be non periodic, and occur less frequently when compared with the often large amount of irrelevant background activities. Recent works suggested that sequential DL models (such as LSTMs) have great potential for modeling nonperiodic behaviours, and in this paper we studied some LSTM training strategies for SAR. Specifically, we proposed two simple yet effective LSTM variants, namely delay model and inverse model, for two SAR scenarios (with and without time critical requirement). For time critical SAR, the delay model can effectively exploit predefined delay intervals (within tolerance) in form of contextual information for improved performance. For regular SAR task, the second proposed, inverse model can learn patterns from the time series in an inverse manner, which can be complementary to the forward model (i.e.,LSTM), and combining both can boost the performance. These two LSTM variants are very practical, and they can be deemed as training strategies without alteration of the LSTM fundamentals. We also studied some additional LSTM training strategies, which can further improve the accuracy. We evaluated our models on two SAR and one non-SAR datasets, and the promising results demonstrated the effectiveness of our approaches in HAR applications.
translated by 谷歌翻译